I endeavour in this project out of my own volition. It is the first time that I come in contact with this data. I carry no bias before carrying this Exploratory Data Analysis, and the primary goal is to further my knowledge in the methods as well as an understanding of different areas through data analysis.
The full name of the data set is “Inpatient Prospective Payment System (IPPS) Provider Summary for the Top 100 Diagnosis-Related Groups (DRG) - FY2011”. It is own and updated by CMS.
The last update was on May 29, 2014, and the metadata was updated on August 30, 2017. For access to the data set as well as further information, please go to this link.
The Dataset has 12 columns. These are the names and the description of them:
The code and description identifying the MS-DRG. MS-DRGs are a
classification system that groups similar clinical conditions (diagnoses)
and the procedures furnished by the hospital during the stay.
The CMS Certification Number (CCN) assigned to the Medicare certified
hospital facility.
The name of the provider.
The provider’s street address.
The city where the provider is located.
The state where the provider is located.
The provider’s zip code.
The Hospital Referral Region (HRR) where the provider is located.
The number of discharges billed by the provider for inpatient
hospital services.
The provider's average charge for services covered by Medicare
for all discharges in the MS-DRG. These will vary from hospital to
hospital because of differences in hospital charge structures.
The average total payments to all providers for the MS-DRG including
the MS-DRG amount, teaching, disproportionate share, capital, and outlier
payments for all cases. Also included in average total payments are
co-payment and deductible amounts that the patient is responsible for and
any additional payments by third parties for coordination of benefits.
The average amount that Medicare pays to the provider for Medicare's share
of the MS-DRG. Average Medicare payment amounts include the MS-DRG amount,
teaching, disproportionate share, capital, and outlier payments for all
cases. Medicare payments DO NOT include beneficiary co-payments and
deductible amounts norany additional payments from third parties for
coordination of benefits. Note: In general, Medicare FFS claims with
dates-of-service or dates-of-discharge on or after April 1, 2013, incurred
a 2 percent reduction in Medicare payment. This is in response to mandatory
across-the-board reductions in Federal spending, also known as
sequestration. For additional information, <a href= "http://www.cms.gov/Outreach-and-Education/Outreach/FFSProvPartProg/Downloads/
2013-03-08-standalone.pdf">visit</a>.
The initial process of refining the dataset was done in a python script named “u_project_EDA-CMS_python3.ipynb”, the code is presented in “u_project_EDA-CMS_python3.html”; for further details examine the HTML file.
Firstly, we load the packages that we are going to use in this project.
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following object is masked from 'package:gridExtra':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Secondly, we load the dataset from its original location to the data frame df_cms.
Before we continue, we check if there are any null values within the data frame df_cms.
There are no null values in the data frame. Afterwards, we check the number of entries present in this dataset.
## [1] 163065 16
The number of entries is 163065. The number of entries is significant. Therefore, We proceed to make a quick summary of the data.
## DRG.Definition
## 194 - SIMPLE PNEUMONIA & PLEURISY W CC : 3023
## 690 - KIDNEY & URINARY TRACT INFECTIONS W/O MCC : 2989
## 292 - HEART FAILURE & SHOCK W CC : 2953
## 392 - ESOPHAGITIS, GASTROENT & MISC DIGEST DISORDERS W/O MCC : 2950
## 641 - MISC DISORDERS OF NUTRITION,METABOLISM,FLUIDS/ELECTROLYTES W/O MCC: 2899
## 871 - SEPTICEMIA OR SEVERE SEPSIS W/O MV 96+ HOURS W MCC : 2812
## (Other) :145439
## Provider.Id Provider.Name
## Min. : 10001 GOOD SAMARITAN HOSPITAL : 633
## 1st Qu.:110092 ST JOSEPH MEDICAL CENTER : 427
## Median :250007 MERCY MEDICAL CENTER : 357
## Mean :255570 MERCY HOSPITAL : 347
## 3rd Qu.:380075 ST JOSEPH HOSPITAL : 343
## Max. :670077 ST FRANCIS MEDICAL CENTER: 277
## (Other) :160681
## Provider.Street.Address Provider.City
## 100 MEDICAL CENTER DRIVE: 183 CHICAGO : 1505
## 800 WASHINGTON STREET : 166 BALTIMORE : 1059
## 1 MEDICAL CENTER DRIVE : 142 HOUSTON : 950
## 100 HOSPITAL DRIVE : 106 PHILADELPHIA: 898
## 100 MICHIGAN ST NE : 100 BROOKLYN : 877
## 1000 BLYTHE BLVD : 100 SPRINGFIELD : 807
## (Other) :162268 (Other) :156969
## Provider.State Provider.Zip.Code Hospital.Referral.Region.Description
## CA : 13064 Min. : 1040 CA - Los Angeles : 3653
## TX : 11864 1st Qu.:27261 MA - Boston : 2910
## FL : 11155 Median :44309 GA - Atlanta : 2630
## NY : 9178 Mean :47938 TX - Houston : 2577
## IL : 7909 3rd Qu.:72901 PA - Philadelphia: 2554
## PA : 7804 Max. :99835 TX - Dallas : 2427
## (Other):102091 (Other) :146314
## Total.Discharges Average.Covered.Charges Average.Total.Payments
## Min. : 11.00 Min. : 2459 Min. : 2673
## 1st Qu.: 17.00 1st Qu.: 15947 1st Qu.: 5234
## Median : 27.00 Median : 25246 Median : 7214
## Mean : 42.78 Mean : 36134 Mean : 9707
## 3rd Qu.: 49.00 3rd Qu.: 43233 3rd Qu.: 11286
## Max. :3383.00 Max. :929119 Max. :156158
##
## Average.Medicare.Payments Census.Region
## Min. : 1149 MIDWEST :39087
## 1st Qu.: 4192 NORTHEAST:30230
## Median : 6158 SOUTH :67038
## Mean : 8494 WEST :26710
## 3rd Qu.: 10057
## Max. :154621
##
## Census.Region.Division Federal.Region
## D5_SOUTCH_ATLANTIC :34118 REGION_IV :37453
## D3_EAST_NORTH_CENTRAL:27434 REGION_V :29686
## D2_MID-ATLANTIC :21808 REGION_VI :20395
## D7_WEST_SOUTH_CENTRAL:19478 REGION_III:17911
## D9_PACIFIC :17814 REGION_IX :17559
## D6_EAST_SOUTH_CENTRAL:13442 REGION_II :14004
## (Other) :28971 (Other) :26057
## Economic.Analysis.Region
## SOUTHEAST :48468
## GREAT LAKES:27434
## MIDEAST :25994
## FAR WEST :19016
## SOUTHWEST :18152
## PLAINS :11653
## (Other) :12348
Due to the vast array of data and groups found in this particular data frame. We will begin our analysis per regions. We will start with “Census Region” since it encompasses all states within the four central regions of the USA, the divisions within regions, the Federal bank that is in charge of the state as well as the Regions established by the Bureau of Economic Analysis.
This graph contains the number of entry per each of the groups mentioned above.
## Warning: Ignoring unknown parameters: binwidth, bins, pad
## Warning: Ignoring unknown parameters: binwidth, bins, pad
## Warning: Ignoring unknown parameters: binwidth, bins, pad
## Warning: Ignoring unknown parameters: binwidth, bins, pad
We will start the initial analysis by obtaining the frequency for each one of the variables of the columns:
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
print(count(df_cms, vars = "Census.Region"))
## Census.Region freq
## 1 MIDWEST 39087
## 2 NORTHEAST 30230
## 3 SOUTH 67038
## 4 WEST 26710
print(count(df_cms, vars = "Census.Region.Division"))
## Census.Region.Division freq
## 1 D1_NEW_ENGLAD 8422
## 2 D2_MID-ATLANTIC 21808
## 3 D3_EAST_NORTH_CENTRAL 27434
## 4 D4_WEST_NORTH_CENTRAL 11653
## 5 D5_SOUTCH_ATLANTIC 34118
## 6 D6_EAST_SOUTH_CENTRAL 13442
## 7 D7_WEST_SOUTH_CENTRAL 19478
## 8 D8_MOUNTAIN 8896
## 9 D9_PACIFIC 17814
print(count(df_cms, vars = "Federal.Region"))
## Federal.Region freq
## 1 REGION_I 8422
## 2 REGION_II 14004
## 3 REGION_III 17911
## 4 REGION_IV 37453
## 5 REGION_IX 17559
## 6 REGION_V 29686
## 7 REGION_VI 20395
## 8 REGION_VII 8493
## 9 REGION_VIII 4283
## 10 REGION_X 4859
print(count(df_cms, vars = "Economic.Analysis.Region"))
## Economic.Analysis.Region freq
## 1 FAR WEST 19016
## 2 GREAT LAKES 27434
## 3 MIDEAST 25994
## 4 NEW ENGLAND 8422
## 5 PLAINS 11653
## 6 ROCKY MOUNTAIN 3926
## 7 SOUTHEAST 48468
## 8 SOUTHWEST 18152
In the Census Region, the region South has the highest account of entries with 67038 and the lowest in the West with 26710. In the Divisions in the Census Region, the Division 5 South Atlantic has the hights entries with 34116, and the lowest in the Division 1 New England.
When considering Regions under the different Federal Banks, Region IV is the one with the highest entry and the region with the lowest entries is Region_XIII with 4283. Lastly, The Rocky Mountain Region of the Bureau of Economic Analysis is the region with the lowest entries in 3926. On the other hand, Southeast is the region with the highest entries with 48468.
Before we continue, we will get the summary of the Total Discharges, Average Covered Charges, Average Total Payments and Average Medicare Payments per Census Region.
by(df_cms$Total.Discharges, df_cms$Census.Region, summary)
## df_cms$Census.Region: MIDWEST
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 17.00 28.00 43.87 50.00 1487.00
## --------------------------------------------------------
## df_cms$Census.Region: NORTHEAST
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 18.00 29.00 45.83 53.00 3383.00
## --------------------------------------------------------
## df_cms$Census.Region: SOUTH
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 17.00 27.00 43.81 50.00 1344.00
## --------------------------------------------------------
## df_cms$Census.Region: WEST
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 15.00 23.00 35.13 39.00 1696.00
by(df_cms$Average.Covered.Charges, df_cms$Census.Region, summary)
## df_cms$Census.Region: MIDWEST
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3611 14792 21413 29395 34654 353774
## --------------------------------------------------------
## df_cms$Census.Region: NORTHEAST
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3481 14964 25870 37123 45428 613927
## --------------------------------------------------------
## df_cms$Census.Region: SOUTH
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2459 15416 23890 33361 40136 480540
## --------------------------------------------------------
## df_cms$Census.Region: WEST
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4202 23310 37111 51836 61668 929119
by(df_cms$Average.Total.Payments, df_cms$Census.Region, summary)
## df_cms$Census.Region: MIDWEST
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2673 5062 6897 9228 10837 131187
## --------------------------------------------------------
## df_cms$Census.Region: NORTHEAST
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2765 5610 7824 10515 12138 140255
## --------------------------------------------------------
## df_cms$Census.Region: SOUTH
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2683 4878 6686 8941 10460 99307
## --------------------------------------------------------
## df_cms$Census.Region: WEST
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2995 6134 8525 11419 13295 156158
by(df_cms$Average.Medicare.Payments, df_cms$Census.Region, summary)
## df_cms$Census.Region: MIDWEST
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1327 4016 5850 7991 9613 130467
## --------------------------------------------------------
## df_cms$Census.Region: NORTHEAST
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1680 4597 6818 9366 11009 133177
## --------------------------------------------------------
## df_cms$Census.Region: SOUTH
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1149 3849 5640 7746 9303 95701
## --------------------------------------------------------
## df_cms$Census.Region: WEST
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1604 5034 7429 10125 11900 154621
We will start by analysing the following group of variables together:
The first graph that is going to show the whole data from df_cms. The whole data is graphed in this way in order to acquire the whole pattern of the data between the two variables.
## Warning: Removed 896 rows containing missing values (geom_point).
## Warning: Removed 66 rows containing missing values (geom_point).
## Warning: Removed 262 rows containing missing values (geom_point).
## Warning: Removed 189 rows containing missing values (geom_point).
## Warning: Removed 379 rows containing missing values (geom_point).
As we can see from the graph West and Midwest regions, have a similar scatter plot, flatter and more left-leaning. However, Midwest has a more linear shape than the West region.
Northeast and South regions have similar scatterplots with widening x and y profile than other regions. However, the South region have a straight line. It will be interesting to know why it was developed.
We start the analysis by determining the Correlation between the two variables. We will perform a Pearson test between the two variables without subdividing the data into regions.
res <- cor.test(df_cms$Average.Covered.Charges,
df_cms$Average.Medicare.Payments,
method = "pearson")
res
##
## Pearson's product-moment correlation
##
## data: df_cms$Average.Covered.Charges and df_cms$Average.Medicare.Payments
## t = 485.66, df = 163060, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7669354 0.7709033
## sample estimates:
## cor
## 0.7689268
The test shows a cor value of 0.767. Values between 0.7 and 0.9 state that there is a steady uphill linear relationship between Average Covered Charges and Average Medicare payments.
For the moment we will only explore the correlation between the two variables, and in a subsequent analysis with more detail and in-depth data we will explore other analysis between these two variables
This first graph is going to deliver an overall picture of the relation between the two data columns; including data patterns.
This graph shows that the majority of total discharges are fewer than 250 with an increase average total payments are mostly under 25000$.
The next graph will plot the same graph as above but with limits of x-value 500 and y-value 50000, as well as four separate graphs with the data frame gather selectively from the regions of the USA.
## Warning: Removed 990 rows containing missing values (geom_point).
## Warning: Removed 155 rows containing missing values (geom_point).
## Warning: Removed 279 rows containing missing values (geom_point).
## Warning: Removed 253 rows containing missing values (geom_point).
## Warning: Removed 303 rows containing missing values (geom_point).
All graphs share a similar pattern than the one shared from the pattern mention in the general and global graph. There are two distinct protrusions of data reflecting an increase of total discharges with approximately an average total payments of about 10000; with a constant increase of discharges, similar to a horizontal line. The pattern is repeated about 15000 Average Total Payments; it is less apparent in the West Region.
We start the analysis by determining the Correlation between the two variables. We will perform a Pearson test between the two variables without subdividing the data into regions.
res <- cor.test(df_cms$Total.Discharges, df_cms$Average.Total.Payments,
method = "pearson")
res
##
## Pearson's product-moment correlation
##
## data: df_cms$Total.Discharges and df_cms$Average.Total.Payments
## t = -6.4699, df = 163060, p-value = 9.833e-11
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.02087213 -0.01116732
## sample estimates:
## cor
## -0.0160201
The test shows a cor value of -0.016. Values between 0 and -0.3 state that there is a weak negative linear relationship between Total Discharges and Average Total Payments.
It is unusual that this the correlation found between these variables; especially since these two variables involve the total of discharges and average total payments.
For the moment we will only explore the correlation between the two variables, and in a subsequent analysis with more detail and in-depth data, we will explore other analysis between these two variables.
This first graph is going to deliver an overall picture of the relation between the two data columns; including data patterns.
This graph shows a linear-like pattern. The majority of the entries are within the liner pattern. Also, the majority of entries are within x-value and y-value limits of 50000.
## Warning: Removed 779 rows containing missing values (geom_point).
## Warning: Removed 106 rows containing missing values (geom_point).
## Warning: Removed 243 rows containing missing values (geom_point).
## Warning: Removed 152 rows containing missing values (geom_point).
## Warning: Removed 278 rows containing missing values (geom_point).
All graphs show the same pattern. There are mild differences in the width. Further analysis could reveal more about this pattern.
We start the analysis by determining the Correlation between the two variables. We will perform a Pearson test between the two variables without subdividing the data into regions.
res <- cor.test(df_cms$Average.Total.Payments, df_cms$Average.Medicare.Payments,
method = "pearson")
res
##
## Pearson's product-moment correlation
##
## data: df_cms$Average.Total.Payments and df_cms$Average.Medicare.Payments
## t = 2746.3, df = 163060, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9892588 0.9894642
## sample estimates:
## cor
## 0.989362
The test shows a cor value of 0.989. Values between 0.7 and 0.9 state that there is a steady uphill linear relationship between Total Payments and Average Medicare Payments.
For the moment we will only explore the correlation between the two variables, and in a subsequent analysis with more detail and in-depth data, we will explore other analysis between these two variables.
set.seed(2500)
ggplot(aes(x=Census.Region.Division, fill=Census.Region),data= df_cms)+
geom_boxplot(aes(y= Total.Discharges))+
coord_cartesian(ylim = c(0, 90))+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
The Dataset is one that shows a significant amount of data towards Medicare payment information in hospitals throughout the USA. It consists of 16 variables and 163065 entries. We explore the various relationships between total payments, Medicare payments, Total Discharges, and regions within the USA.
The dataset is vast and much more analysis could be performed, for example, the differences in payment from Medicare and Total payments made per procedure per hospitals in the same region as well as through states or regions in the USA.